Evaluating Content Extraction on Html Documents

نویسنده

Thomas Gottron

چکیده

A variety of applications uses methods to determine and extract the main textual contents of an HTML document. The performance of the methods employed in this task is rarely evaluated. This paper fills this gap by introducing a platform independent and extensible framework for measuring, evaluating and comparing the performance of methods for Content Extraction. We further give an overview over extraction algorithms found in domain specific applications and present an adaptation of a related algorithm to perform Content Extraction. We compare the algorithms using the developed framework and show that our adapted algorithm performs best on most HTML documents.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient Text Content Extraction and Browsing of WWW Documents Using the Abstract Text Viewer

The Abstract Text Viewer (ATV) is an integrated suite of text reading tools for electronic documents designed to increase efficiency and effectiveness of content extraction. ATV reads a HTML formatted document to create more abstract representations, such as a heading structure for overviews. The system uses both well-known techniques for text representation and novel display and content extrac...

متن کامل

Information Extraction from HTML Documents Based on Logical Document Structure

The World Wide Web presents the largest Internet source of information from a broad range of areas. The web documents are mostly written in the Hypertext Markup Language (HTML) that doesn’t contain any means for semantic description of the content and thus the contained information cannot be processed directly. Current approaches for the information extraction from HTML are mostly based on wrap...

متن کامل

Robust Web Data Extraction with XML Path Expressions

Automated extraction of structured Web data has attracted considerable interest in both the academia and industry. A particularly promising approach is to employ XML technologies to translate semi-structured HTML documents to “pure” XML documents. In this approach, HTML documents are first normalized into XHMTL and then mapped to the desired XML application format by using XML path expressions ...

متن کامل

Optimized Content Extraction from web pages using Composite Approaches

The information available today on web is tremendous and comes with greater challenges. Content extraction identifies the main content and removes the clutter from web pages. The main problem in extracting the content from the web page is the newer architecture of web pages and the diversity in the structure of web pages. Optimized content extraction from HTML documents using collective approac...

متن کامل

Kitten: a tool for normalizing HTML and extracting its textual content

The web is composed of a gigantic amount of documents that can be very useful for information extraction systems. Most of them are written in HTML and have to be rendered by an HTML engine in order to display the data they contain on a screen. HTML files thus mix both informational and rendering content. Our goal is to design a tool for informational content extraction. A linear extraction with...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2007

Evaluating Content Extraction on Html Documents

نویسنده

چکیده

منابع مشابه

Efficient Text Content Extraction and Browsing of WWW Documents Using the Abstract Text Viewer

Information Extraction from HTML Documents Based on Logical Document Structure

Robust Web Data Extraction with XML Path Expressions

Optimized Content Extraction from web pages using Composite Approaches

Kitten: a tool for normalizing HTML and extracting its textual content

عنوان ژورنال:

اشتراک گذاری